In my opinion, before constructing any model, the first and foremost thing is doing Exploratory Data Analysis(Descriptive Statistics)
Let’s start from understanding the wine data !
As we seen, in aspect of tendency :
A-cultivars are relative highB-cultivars are relative lowC-cultivars are relative highFollowing requires attention:
mg, althoug the value of B-cultivars tend to low, there are some outliers higher than the values of A-cultivars and C-cultivars !As results of EDA, there is something coming up in our mind:
Before we build up a classification model, the performance of classification may be not bad, and there is some variables significantly contribute to classify
Moves to next part, Correlation between chemical result and cultivars
In order to develop the favorable decision rule to classify cultivar,
we need to compare the true error rate by CV(leave-one-out) between all decision rule.
First, we construct all decision as follows, then we can combine all of results for comparing.
The result of tree shows that the tree does not necessarily be prune, since the tree is not complex.
Moreover, the ture error rate is 14%.
In the plot, the recommanded suggestion of cp are between 0.042 and 0.017, because which all below the horizon line. Hence, we choose 0.042 because the size of tree is more simple than 0.017, and the ture error rate is 15%.
Classification tree:
rpart(formula = class.id ~ ., data = wine.data, method = "class",
control = wine.control)
Variables actually used in tree construction:
[1] falvanoids hue od proline
Root node error: 101/168 = 1
n= 168
CP nsplit rel error xerror xstd
1 0 0 1 1 0
2 0 1 1 1 0
3 0 2 0 0 0
4 0 3 0 0 0
5 0 4 0 0 0
Classification tree:
rpart(formula = class.id ~ ., data = wine.data, method = "class",
control = wine.control)
Variables actually used in tree construction:
[1] falvanoids od proline
Root node error: 101/168 = 1
n= 168
CP nsplit rel error xerror xstd
1 0 0 1 1 0
2 0 1 1 1 0
3 0 2 0 0 0
4 0 3 0 0 0
The ture error rate of LDA is 1.7%
A B C
A 55 0 0
B 1 65 1
C 0 1 45
The ture error rate of LDA is 0.6%
A B C
A 55 0 0
B 1 66 0
C 0 0 46
After repeatly performing NN classifier with different K, the favorable K would be 2 to 4.
k.choice ture.error.rate
1 1 0.00
2 2 0.13
3 3 0.14
4 4 0.17
5 5 0.22
6 6 0.23
7 7 0.21
8 8 0.23
9 9 0.24
10 10 0.21
11 11 0.23
12 12 0.23
13 13 0.24
14 14 0.26
15 15 0.27
16 16 0.27
17 17 0.27
18 18 0.27
19 19 0.27
20 20 0.27
We select parameters(cost, gamma) both from 0.1 to 1 in a leave-one-out situation. There are 100 results come out.
The best parameters set is located at gamma: 0.1, cost:0.4 to 1 which are the same. We can assume that SVM achieve the best performance when gamma=0.1 and cost from 0.4 to 1 .
c gamma acc
1 0.4 0.1 98
2 0.5 0.1 98
3 0.6 0.1 98
4 0.7 0.1 98
5 0.8 0.1 98
6 0.9 0.1 98
7 1.0 0.1 98
Above all, the lowest true error rate are QDA and SVM.
We use bagging to optimize the prediction by the result of prune tree and reduce the error of sampling from training data.
We find out that the Bagging result of prune tree will be different from the original one.
That is Bagging method improve the prediction ability.
As result, the prediction result of bagging is slightly diferent to prediction result without bagging.
method result
1 Bagging BBBBCC
2 Boosting BBBCCA
3 Tree BBBBBC
4 LDA BBCBCB
5 QDA BBBBCB
6 NN CBBBCA
7 SVM ABBBBB
As result, we can clearly identify that there is three groups in wine.data because the banner plot in the below obiviously can be divided to three parts.
In order to examine the result of Hierarchical tree method with ward linkage,
we can check : Agglomerative Coefficient(AC) and whether the resulting groups agrees with original class.id
[1] 0.94
[1] A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A B A A
[36] A B A A A A A A A A A A A A A A A A A A A A B B B B B B B B B B B B B
[71] B B B C C C B B B B B B B B B B B B B B B B B B B B B B B B B B B B B
[106] B B B B B B B B B B B B B B B B B B B C C C C C C C C C C C C C C B C
[141] C C C C C C C C C C C C C C C C C C C C C C C C C C C C
Levels: A B C
At first, we need to choose K group before performing K-means approach.
Several suggestions have been made as to how to choose the number fo groups.
Based on the most simple way to select K,
we can use the solution from hierarchical clustering, 3 groups.
The texts in the plot are the result of k-means.
As result, few overlapping between three clusters and original class.id point out that the k-means is favorable to this data.
In previous section(EDA), because we found some outliers in this data,
Partitioning Around Medoids, a way which is more robust, is worth to try.
Average silouette width = 0.57 indicates that reasonable structure.
Namely, PAM is not bad but not excellent!
根據SOM的特性,顏色越近代表越相似,從圖可以看出資料可以被切成三個group
[1] 1 1 2 2 2 2 2 2 3 3 4 1 1 2 2 2 2 2 3 3 4 1 1 1 2 2 2 2 3 3 4 4 1 1 1
[36] 2 2 2 3 3 4 4 4 1 1 2 2 2 2 2 4 4 4 4 1 1 1 2 2 2 4 4 4 4 1 1 1 2 2 2
[71] 5 5 4 4 4 1 1 1 2 2 5 5 4 4 4 1 1 1 2 2 5 5 5 4 4 1 1 1 1 2